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Abstract 

In this paper we apply new geometric and combinatorial methods to 
the study of phylogenetic mixtures. The focus of the geometric approach 
is to describe the geometry of phylogenetic mixture distributions for the 
two state random cluster model, which is a generalization of the two state 
symmetric (CFN) model. In particular, we show that the set of mixture 
distributions forms a convex polytope and we calculate its dimension; 
corollaries include a simple criterion for when a mixture of branch lengths 
on the star tree can mimic the site pattern frequency vector of a re- 
solved quartet tree. Furthermore, by computing volumes of polytopes we 
can clarify how "common" non-identifiable mixtures are under the CFN 
model. We also present a new combinatorial result which extends any 
identifiability result for a specific pair of trees of size six to arbitrary pairs 
of trees. Next we present a positive result showing identifiability of rates- 
across-sites models. Finally, we answer a question raised in a previous 
paper concerning "mixed branch repulsion" on trees larger than quartet 
trees under the CFN model. 

Keywords: phylogenetics, model identifiability, mixture model, polytope, 
discrete Fourier analysis 



Molecular phylogenetic inference methods reconstruct evolutionary history 
from sequence data. Many years of research have shown that if data evolves 
according to a single process under certain assumptions then the underlying 
tree can be found given sequence data of sufficient length. For an introduction 
to this literature see [5] or [12] . 

However, it is known that molecular evolution varies according to position, 
even within a single gene [13] . Between genes even more heterogeneity is ob- 
served [ID], though it is not unusual for researchers to concatenate data from 
different genes for inference [TT] . This poses a different challenge for theoretical 
phylogenetics: is it possible to reconstruct the tree from data generated by a 
combination of different processes? 
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This question is formalized as follows. The raw data for most phylogcnctic 
inference techniques is site-pattern frequency vectors, i.e. normalized counts 
of how often certain data patterns occur. If multiple data sets are combined, 
the corresponding site-pattern frequency vectors are combined according to a 
weighted average. In statistical terminology, this is called a "mixture model." 
In the phylogenetic setting, there are various means of generating a site-pattern 
frequency vector given a tree with edge parameters, for example the expected 
frequency vector under a mutation model. 

Definition 1. Assume some way of generating site-pattern frequency data from 
trees and edge parameters, i.e. a map tp from pairs (T^,^) to site pattern fre- 
quency vectors. We define a phylogenetic mixture (on h classes) to be any 
vector of the form 



where for each i, cti > and ^ i cti = 1. When all of the Ti are the same, we 
call the phylogenetic mixture a phylogenetic mixture on a tree. 

The formal version of our question is now "given a phylogenetic mixture (JT|) 
can we infer the trees Ti and the edge parameters 

The answer to this question is certainly "not always." In 1994 Steel et. al. 
[13] presented the first "non-identifiable" examples, i.e. phylogenetic mixtures 
on a tree such that the underlying tree cannot be inferred from the data. More 
recently, Stefankovic and Vigoda [IB] were the first to explicitly construct such 
examples. Even more recently, Matsen and Steel [7] showed the stronger state- 
ment that a phylogenetic mixture on one tree can "mimic" (i.e. give the same 
site-pattern frequency vector as) an unmixed process on a tree of another topol- 



This raises several questions, some of which are answered in this paper for 
the two state models and some generalizations. First, now that we know these 
non-identifiable examples exist, is there some way of describing exactly which 
site-pattern frequency vectors correspond to non-identifiable mixtures? Below 
we note that the set of mixture distributions on a tree of a given topology forms 
a convex polytope with an simple description (Proposition 1 10[) ; thus the non- 
identifiable patterns (being a finite intersection of polytopes) form a convex 
polytope as well. Now, computing dimensions shows that a "random" site- 
pattern frequency vector has a non-zero probability of being non-identifiable, 
which raises the question of the relative volumes of a given tree polytope and the 
non-identifiable polytopes. This question is answered by computer calculations 
for the quartet case in Tablefl] We also show that surprisingly well-resolved trees 
sit inside the phylogenetic mixture polytope for the star tree (Proposition \22\ . 
This same proposition implies that the internal edge of a quartet tree must be 
long compared to the pendant edges if the corresponding site-pattern frequency 
vector is to be identifiable. 

The second main section focuses on identifiability results for mixtures of two 
trees under various assumptions. These results partially "bookend" the non- 
idcntifiability results of [3111]. The first emphasis for this work is combinatorial, 
answering the question (Theorem |23|) "if we know all of the splits associated 
to the restriction of a pair of trees to taxon subsets of size k, is it possible 
to reconstruct the pair of trees?" This gives a theorem which extends any 
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identifiability result for a specific pair of trees of size six to arbitrary pairs of trees 
under a molecular clock fTheorem [2"5|) . A different approach shows identifiability 
of rates-across-sites models for pairs of trees (Theorem [3"U|) . Finally, we show 
that if a two class phylogenetic mixture on a single tree mimics the expected site- 
pattern frequency vector of a tree on another topology then the two topologies 
can differ by at most one nearest neighbor interchange. 

1 Geometry of unbounded mixtures on one or 
more topologies 

In this section we show that the space of phylogenetic mixtures under the ran- 
dom cluster model is the convex hull of a finite set of points, i.e. a convex 
polytope. The description of the vertices of the polytope has some interesting 
consequences discussed in Section If. 21 We then compute dimensions, which is 
motivated in part by the following theorem of Caratheodory: 

Theorem 2. If X is a d- dimensional linear space over the real numbers, and 
A is a subset of X , then every point of the convex hull of A can be expressed as 
a convex combination of not more than d + 1 points of A. 

A proof can be found as statement 2.3.5 of [S]. Therefore if we know that 
the dimension of a certain set of phylogenetic mixture distributions is d, then 
any mixture distribution in that set can be expressed as a phylogenetic mixture 
with no more than d + 1 classes. 

We also show that the dimension of those site-pattern frequency vectors 
which can be written as phylogenetic mixtures on the star tree is equal to the 
corresponding dimension for all topologies together. This forms an interesting 
contrast to the genericity results in pQ. 

Convex polytopes are typically specified in one of two ways: by a V- description, 
as the convex hull of a finite set of points, or by an H- description, as the bounded 
intersection of finitely many half-spaces. Classical algorithms exist to go be- 
tween the two descriptions; these are implemented in the software polymake 0]. 
We will make use of both descriptions; for example, the intersection of poly- 
topes can be easily computed by taking the union of the two sets of inequalities 
describing the half-spaces of the ff-descriptions. More introductory material 
about polytopes can be found in the texts of Griinbaum [5] and Ziegler |16j . 

From the phylogenetic perspective, we are interested in the set of site pattern 
frequency vectors which correspond to non-identifiable mixtures. In particular, 
one might ask the question: which site-pattern frequency vectors can be ex- 
pressed as a phylogenetic mixture on any one of a collection of tree topologies? 
At least in the case of the random cluster model, the answer is the intersection of 
the corresponding phylogenetic mixture polytopes. Using polymake and Propo- 
sition [TU] this becomes an easy exercise for small trees: simply take the union of 
the -ff-description inequalities for the polytope associated with each topology. 
Although the complexity of going from a ^-description to an .ff-description is 
still open [6], in practice no fast algorithm is known and so our approach may 
not feasible for large trees. We analyze the polytopes associated with quartet 
trees in Section [TT2"1 
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1.1 The random cluster model 

In this section we define the random cluster model, which generalizes the two 
state symmetric (CFN) and Jukes-Cantor DNA models 3 in two ways: first, 
it allows an arbitrary number of states, and second, it allows non-uniform base 
frequencies. We will use the common convention that [k] := {1, . . . , k). Assume 
k states, and fix a distribution 7r = (jri : i G [k]) as the stationary distribution 
on those states. It is always assumed that TTi > for all i £ [k]. We will label 
the n states x%, . . . , x n . 

First we define a distribution on site patterns based on partitions. Informally, 
we sample once from tt for each set of the partition, and assign that value to 
each element of that set. 

Definition 3. Let S = {Si, . . . , S r } be a partition of [n]. We denote by D$ the 
probability distribution of the random vector (xi, . . . ,x n ) obtained by sampling 
Hi independently from n for each i £ [r] and assigning state yi to all of the Xj 
such that j £ Si. 

We make the following simple observations: 

Lemma 4. Assume {x\, . . .x n ) is distributed according to D$- Then 

• The marginal distribution of each Xi is given by tt. 

• For all S £ S and i,j £ S it holds that xi = xj. 

• The collections of random variables {xs ■ S G S} are mutually indepen- 
dent, where 

x s = {xi : i G S}. 

□ 

Definition 5. For any tree T — (V,E) and function c : E — > [0,1], define 
the random cluster model as follows: For each edge e declare the edge "closed" 
with probability c(e) and declare it "open" otherwise. Let Si, . . . , S r denote the 
maximal open- edge connected components of V . Now define the partition S — 
Si,...,S r and sample a site pattern from the distribution D$ as in Definition^ 
We use Dt.c denote the induced distribution of state assignments to the leaves. 

We will also consider the case k — oo in which different clusters will always 
be assigned different states. Note that this particular case is what was referred 
to as the "random cluster model" in [S]. 

The CFN and Jukes-Cantor DNA models arc random cluster models with 
7r the uniform distribution on 2 and 4 states respectively. In general, for any 
fc-state model with uniform stationary frequencies, the corresponding probabil- 
ity in the random cluster model that an edge is closed is kj (k — 1) times the 
probability of mutation along that edge (see, e.g., 12J p. 197). 

Definition 6. A binary edge vector is a mapping g : E — > {0, 1} taking the 
value 1 if the edge is closed and if the edge is open. 

Definition 7. Given an edge probability vector c : E — ► [0, 1] let J c be the 

associated distribution on binary edge vectors, i.e. 

jm= n*) s(e) (i-^)) i_s(e) - 

eS-E 
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The following lemma can be checked by substituting in the previous defini- 
tion. 



Lemma 8. If c± and C2 differ on at most one edge e and if c — ac\ + (1 — a)c2, 
then 



Let x : [n] — > [k] be an assignment of states to taxa, i.e. a site pattern. 
We can write out the probability of seeing this site pattern under the random 
cluster model as 



9 

where P(x\g) is the probability of seeing x assuming a binary edge vector g. 
Using this we have 

Proposition 9. For any tree T and any c, the distribution Dt.c is a convex 
combination of distributions DT, Ci where Ci obtains only the values or 1. 

Proof. Using Lemma [5] and ^ we can proceed stepwise: first we obtain (by 
averaging Dt.cJ the set of vectors with the correct first coordinate of c and 
arbitrary other coordinates chosen from {0, 1}. Averaging these vectors one can 
obtain a set of vectors with the first two coordinates correct, and so on. □ 

By grouping all of the open-edge-connected subsets into a partition or by 
opening and closing edges according to a partition, one has the following lemma. 

Proposition 10. Let T be a phylogenetic tree and let c be edge probabilities, 
all of whose values are in {0, 1}. Then Dt, c = Ds for some partition S of [n]. 
On the other hand, for every partition S of [n] there exists a phylogenetic tree 
T and edge probabilities c £ {0, 1} E such that Dt, c = D$. □ 

In fact, the distributions D$ determine the convex geometry of phylogenetic 
mixtures. 

Theorem 11. The set of phylogenetic mixtures on trees over n leaves is a 
convex polytope with vertices 



Proof. The set of phylogenetic mixtures is convex by definition. By Propo- 
sitions [H] and [TO] it follows that every phylogenetic mixture can be written as a 
convex sum of the elements D$. It thus remains to show that we cannot write 
Ds as a convex combination of D$ 1 , . . . , D$ k if S ^ {Si, . . . , Sk}- 

Assume by contradiction that 



J c (g) = aJ Cl (g) + (1 - a)J C2 (g). 



□ 




(2) 



{Ds ■ S a partition of [n]}. 




(3) 



i 



where oti > for all i and aij = 1. 



Claim 12. S is a refinement of St for all i. 
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Proof. Suppose S does not refine Si . Thus there exist i ^ j such that i and j 
belong to the same set in S but do not belong to the same set in Si. But this 
implies by definition that for D$ we have that Xi — Xj with probability one while 
for the variables Xi and X j 0X6 independent. This is a contradiction. □ 

We use D[f] to denote the expectation of / under the distribution D. The 
following claim concludes the proof of the theorem. 

Claim 13. D$ cannot be written as a convex combination of the Dsi- 

Proof. By the previous claim, we may assume where S is now a refinement 
of each of the Si . Let 

f(xi, . . . ,X n ) = ^2 l{Xi = Xj). 

id 

Note that for a general partition S' it holds that 

D S '{f] = \S'\l + (n 2 -\S'\l)\7r\l 

where |<S'| 2 = J2seS' \S\ 2 an d |tt| 2 = Y^xeik] n x- ^ n particular, it follows that 
since S is a refinement of Si and S ^ Si for all i, we have -DsJ/] > As[/] f° r 
all i. Plugging this into (J3j) we obtain a contradiction. The proof of the claim 
follows, thereby completing the proof of Theorem [TT1 □ 

Now we calculate dimensions. The dimension of a convex polytope is defined 
to be the dimension of its affine hull. We do not give a general dimension formula 
here - instead we will just discuss the two state and infinite state models. We 
let T> n (l/2, 1/2) denote the space of all distributions that can be written as a 
convex combination of phylogenetic trees on n leaves under the CFN model, and 
let 2?* (1/2, 1/2) denote those which can be written using sets of edge lengths 
on the star tree with n leaves. 

Proposition 14. 

dim(2^(l/2, 1/2)) = dim(X>„(l/2, 1/2)) = 2"- 1 - 1. 

Proof. We will work with the two-state Fourier transform F as follows. Because 
in this case the stationary distribution is uniform, we can work with "collapsed" 
site-pattern frequency vectors; we index these by subsets B C [n — 1} (see, e.g., 
|12j). Now, rather than having the two states be and 1, take them to be —1 and 
1. Thus, the S-coordinate of a site-pattern frequency vector is the probability 
of having B be exactly the set of indices i such that Xi = — 1. Define for any 
A C [n — 1] and D any distribution on (collapsed) site-pattern frequencies 



F A {D) = D 



n 



To see the connection with the Fourier transform defined by a Hadamard matrix, 
pick some B C [n — 1] and take D' to be the distribution that assigns — 1 exactly 
to the Xi with i £ B (with probability one). Then 



F A (D') = (-1) 



AnB\ 
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This connection demonstrates that F is invertible. Now, since the Fourier trans- 
form is linear and invertible, and we can compute the dimension of the P's by 
computing the dimension of their image under the Fourier transform. 
By definition we have 

F [Dt, c ] = 1, (4) 

and it is known that 



F A [D T ,c] = for all A of odd size (5) 

for all T and c. This last fact can be seen as follows. By Proposition [5] we can 
assume that Dt, c is given by independent assignment of states (according to ir) 
to clusters Si, ■ ■ . , S r . Because the cardinality of A is odd, at least one of the 
A n Sj must have odd size, and 



D 



n x * 

ieAnSj 



1 1 
-1 h 1 ■ - 

2 2 



Equation (J5J now follows because the expectation of a product of independent 
random variables is the product of the expectations. 

It thus follows that equalities ([!]) and (JS|) hold for all distributions in T>. This 
implies that 

dim(P„(l/2,l/2))<2"- 1 -l. 

We show next that 

2"- 1 - 1 < dim(P*(l/2, 1/2)) < dim(Z>„(l/2, 1/2)) (6) 

which will imply the proposition. The second inequality follows by containment. 

Now we show the first inequality. Given a set S, consider the partition p(S) 
that has the sets S and a singleton set corresponding to each element of [n] \ S. 
This partition can be achieved on the star tree by declaring all of the edges in 
S to be closed with probability one and all of the other edges to be open with 
probability one. By the same argument as for 

Fa[-D p(S )] = 1 iff A C S and A is even. 

Thus Fa[D p (Q)] is zero for all A ^ 0. It follows (using the fact that F$[Dt, c ] = 1 
for any T, c) that in this case affine dimension coincides with linear dimension. 
Therefore to show the first inequality of (JSJ) it suffices to find for every set S 
of even order a linear combination of elements of X>* (1/2, 1/2) whose Fourier 
coefficient at S is 1 and is at all other sets. An inductive argument shows 
that in order to achieve this task, it suffices to show that for every even set S 
there exists an element of V whose Fourier coefficient at every even subset of S 
is 1 and is zero on all other sets. This is exactly -D p (s) as described above. The 
proof follows. □ 

We now analyze the random cluster for k = 2 when the distribution 7r is 
not uniform. Define T>* L (r, 1 — r) and T> n {r, 1 — r) for the case of non- uniform 
7r = (r, 1 — r) analogous to the symmetric (CFN) case for any < r < 1. 
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Proposition 15. Let < r < 1 and r ^ 1/2. Then 



dim(X>* (r, 1 - r)) = dim(X>„(r, 1 - r)) = 2" - n - 1. 

Proof. Here we need a variant of the above-described Fourier transform — now 
we take the state space to be {r — l,r}, with tt giving the first state with 
probability r and the second state with probability 1 — r. Again F will denote 
the Fourier transform so that 



F A (D) = D 



n 



However, there is one subtle difference, which is that because the stationary 
distributions are not uniform, we cannot collapse the site-pattern frequency 
vectors. Thus the above A is a subset of [n], and the coordinates of D are now 
indexed by subsets of [n]. The matrix representation of this transform in the 
n = 1 case in the basis {0, {1}} is thus 



X 



1 1 

r r — 1 



For n > 1, the matrix representation is the n-fold Kronecker product of X; it 
follows that this transform is invertible for all < r < 1 . As before we calculate 
the dimension of the Fourier transform of the T>. By definition 

F [£> T ,c] = l, 

and if A is a singleton then 

F A \D T ,c] = 0, 

for all T and c by a similar argument to before. It thus follows that the equalities 
above hold for all distributions in the V. This implies that 

dim(D„(r, 1 - r)) <2 n -n-l. 

As before, given a set S, consider the partition p(S) that has the sets S and 
a singleton set corresponding to each element of [n] \ S. Then 

Fs[D p(S )} = r (r - 1) |S| + (1 - r) = r (r - 1) ((r - l)' s ' _1 + r^ 1 ) ± 0, 

since < r < 1, r ^ 1/2 and \S\ > 1. On the other hand, if A is not a subset of 
S then 

F A [D p(s) ] = 

by an argument as in the previous proof. 

As before the affine dimension coincides with the linear dimension. To prove 
the corresponding lower bound it suffices to find for every set S of size at least 
two a linear combination of elements of 2?* (r, 1 — r) whose Fourier coefficient at 
S is one and is zero at all other sets. An inductive argument using D p ( S ) again 
concludes the proof. □ 

We have just seen how for the CFN model the affine dimension of the space 
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of phylogenetic mixtures (which has exponential order in n) is much smaller 
than the number of extremal points (which is the number of partitions of n). In 
contrast, for k — oo, the dimension equals the number of extremal points. This 
follows from the following proposition. 

Proposition 16. The distributions D$ where S runs over all partitions of [n] 
are linearly independent. 

Proof. Recall that in the k = oo model, each partition is assigned a different 
state. Thus there is nothing to prove as the probability space we are working 
in is the space of partitions of [n] . □ 

1.2 The phylogenetic mixture polytope for the CFN model 

This section specializes to the case of phylogenetic mixtures under the CFN 
model. As mentioned previously, the CFN model is equivalent to the random 
cluster model with two states and a uniform stationary distribution. Rather 
than probabilities of edges being open and closed, however, it is described in 
terms of "branch lengths." For a given branch length 7 we will call 9 = exp(— 27) 
the "fidelity" of an edge, which ranges between zero (infinite length edge) and 
one (zero length edge) for non-negative branch lengths. The closed-edge prob- 
ability c for that edge is then 1—6 which is twice the probability of a state 
change along that edge. 

Corollary 17. The set of phylogenetic mixtures under the CFN model on a 
given tree is a convex set whose extremal points are given (perhaps with repeti- 
tion) by branch length assignments to that topology taken from the set {0, 00}. 



Proof. A branch length of zero corresponds to an edge being open in the random 
cluster model with probability one, and a branch length of infinity corresponds 
to an edge being closed with probability one. The corollary now follows from 
Proposition [9l □ 

Before analyzing various associated polytopes, we fix some notation and re- 
mind the reader of some facts. Denote site patterns on n taxa using subsets 
A C [n — 1] in the "collapsed" notation as before. Note that one could cquiv- 
alently use even sized subsets of [n] via the f{A) below as in [7]. We will use 
Pa to denote the probability of a collapsed site pattern A and qA to denote the 
Ath component of the Fourier transform as in [7] [12] . We will denote the corre- 
sponding vectors by p and q. The Hadamard matrices will be denoted H; H is 
symmetric and HH = 2 n ~ 1 I when H is n by n. We will denote inner product 
of v and w by (v,w) and will often use the fact that (Hv,w) — (v,Hw). We 
will take to be the vector with A'th component one and other components 
zero. We will also use the following lemma, from the the proof of Theorem 8.6.3 

of [12]. 

Lemma 18. For any subset A C {1, . . . , n — 1} of even order, let 




A if \A\ is even 
A U {n} otherwise. 
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Then 



qa= n °^ 

e£V(TJ(A)) 



where V(T, f(A)) is the unique set of edges which lie in the set of edge-disjoint 
paths connecting the taxa in f(A) to each other. □ 

We will abuse notation by taking Co(Ti, . . . , T n ) to denote the convex hull 
of phylogenetic mixtures on trees Ti, . . . , T„ of the same number of leaves. 

There are four tree topologies on four taxa: the star tree T* and the three 
resolved trees on four taxa Ti, T 2 , and T3. Thus, up to isomorphism, there are 
six convex polytopes of interest in this case, with inclusions as indicated: 



c 


Co(Xi) n Co{t 2 ) n Co(t 3 ) 


(8) 


c 


Go{Tx) n Co{T 2 ) 


(9) 


c 


Co{T x ) 


(10) 


c 


Co(T!,T 2 ) 


(11) 


c 


Co(T u T 2 ,T 3 ), 


(12) 



It will be shown below that the inclusion in |(5J) is an equality. 

From a phylogenetic perspective, polytope ([HI) represents those site-pattern 
frequency vectors which can be realized as a mixture on any of the four topolo- 
gies. Polytope (J9j) contains the distributions from mixtures on two of the re- 
solved topologies. Polytopes (fTU)) . (fTTj) . and (fT2"|) correspond to mixtures on one, 
two, or three resolved topologies. 

Polytopes (JHJ) and (O are of special interest, as they represent mixtures 
which are non-identifiable for phylogenetic reconstruction. In Observations 1201 
and I2T1 we are able to precisely delineate the set of non- identifiable mixtures; 
these generalize the non-identifiable mixture examples of |T5] . The drawback 
is that the mixtures found here may use as many as eight sets of branch lengths 
(recall Theorem[2]) rather than just two, and that we may mix trees with extreme 
branch lengths. 

There is one more polytope which we will investigate, which is that cut out 
by inequalities known to be satisfied for phylogenetic mixtures. We will call this 
polytope L. Specifically, L is the polytope cut out by < qA < 1 for any A, and 
the Fourier transform of the inequalities < pa < P$ for any A and the equality 
J2aPa = !• Note that the equality is equivalent to q% = 1. The inequality 
PA > is equivalent to {eA,p) > (where e& is the unit vector defined just 
prior to Lemma fT8|) . and this is equivalent to 

(He A ,q}>0. (13) 

The following observation notes further redundancies. 

Observation 19. (HeA,q) > and qA > for every split A implies q^ > qA 
for every split A. These same hypotheses also imply that the corresponding 
probability distribution on splits is "conservative," i.e. that p^ > pa for any A. 

Proof. Assume there are n taxa. For the first assertion, let J be the n by n 
matrix with all entries one. Then J — H is a matrix with non-negative entries. 
Therefore {HeA, q) > for every split A implies that (H(J — H)eA, q) > for 
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every split A. But HJe A = HI = 2"" 1 e and HH = 2"" 1 /, giving the first 
assertion. For the second assertion, note that He$ — He a is a vector with non- 
negative entries, since He® has all entries equal to +1 while Hca has half its 
entries equal to +1 and half equal to —1. Thus {He$ — He a, q) is non- negative 
given the assumptions. Thus (e$ — e.A,p) > 0, which is equivalent to the second 
assertion. □ 

Because of these observations we note that L is the polytope in Fourier 
transform space cut out by q A > and (fT5| for each A, as well as q® = 1. 

The following is a simple use of polymake to go from a ^-representation to 
an iJ-representation. 

Observation 20. Co(T±) is defined by q$ — 1, (7123 > and the inequalities 
\13\) and qA > 9123 for each A. □ 

Another polymake calculation demonstrates 

Observation 21. The inclusion in is an equality. In phylogenetic terms, 
the site-pattern frequency vectors obtainable as a phylogenetic mixture on a tree 
for each of the three resolved quartet topologies are exactly those obtainable as 
a phylogenetic mixture on the four taxon star tree. □ 

We can now see what trees sit inside the star tree polytope Co(T + ). 

Proposition 22. The resolved quartet trees whose site-pattern frequency vectors 
are obtainable as a phylogenetic mixtures on the four taxon star tree are exactly 
those such that the internal branch length is shorter than the sum of the branch 
lengths for any pair of non-adjacent edges. 

This proposition may come as a surprise for phylogenetics researchers: even 
though a given data set may not have any evidence for a particular split, the 
data can appear to be exactly that generated on a tree with an internal edge 
which is longer than any of the pendant edges. Said another way, in order for the 
vector of expected site-pattern frequencies for a quartet tree to be identifiable, 
it is necessary that the internal edge must be longer than the sum of the branch 
lengths for a single pair of non-adjacent pendant edges. 

Proof. Let q denote the Fourier transform of the site-pattern frequency vector 
for the tree in question, which we assume without loss of generality to have 
topology 12 1 34. This q can be expressed as a phylogenetic mixture on the star 
tree exactly when it satisfies the conditions in Observation Because q is 
the Fourier transform of a site-pattern frequency vector generated on a tree, by 
the above #0 = 1, #123 > 0, and the inequality (fT3|) is thus satisfied for any A. 
Now for each A C {1,2,3} we investigate the consequences of the inequality 
qA > 9i23- For A = {1}, the inequality becomes by ([7]) 

616564 > 61626364 #5 > #2#3- 

Repeating the process for A = {2}, {1, 3}, {2, 3} and simplifying gives 
6 5 > maxima, 0104, 0203,0204}- 
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The cases A = {1,2}, {3} give 1 > #36*4 and 1 > 6*16*2, which are trivially 
satisfied, as is the case of A = {1, 2, 3}. Taking logarithms and dividing by —2 
gives 

75 < min{7i + 73, 71 + 74, 72 + 73, 72 + 74}- 

□ 

In the previous section we showed that the dimension of those site-pattern 
frequency vectors which can be realized as a phylogenetic mixture on the star 
tree is equal to the dimension of those pattern probabilities which can be realized 
as an arbitrary phylogenetic mixture. This means that given a sample from 
any nowhere-zero probability distribution on arbitrary phylogenetic mixtures 
there is a non-zero probability of having the sample be realizable from the set of 
mixture distributions on the star tree. However, it does not give any quantitative 
information. Quantitative answers for this and related questions for the uniform 
distribution on site-pattern frequencies can be calculated by using polymake to 
calculate volumes. Results are reported in Table [TJ 

For example, assume we uniformly choose a random probability distribution 
on patterns obtained by a phylogenetic mixture on a given tree. Then there is a 
probability of approximately 0.57 (« 0.173/0.302) that it is non-identifiable, i.e. 
that it can be written as a phylogenetic mixture on another tree. More work on 
the relevant geometry is needed to determine if such mixtures pose problems in 
the parameter regimes usually found in phylogenetics. 



poly tope 


relative volume (approx.) 


absolute volume 


Co{T*) 


0.143 


5/1008 


Co(Ti) n Co{T 2 ) 


0.173 


13/2160 


Co{T x ) 


0.303 


53/5040 


Co{T x ,T 2 ) 


0.566 


11/560 


Co(T u T 2 ,T 3 ) 


0.909 


53/1680 


L 


1 


5/144 



Table 1: Relative volumes of the polytopes described in the text. The absolute 
volume is that computed in Fourier transform (i.e. q-) space. 



2 Mixtures of two trees 

In this section we specialize to the case of phylogenetic mixtures on two trees, 
but we generalize the set of mutation models considered. 

2.1 Combinatorics 

In this section we establish a new combinatorial property that allows pairs of 
binary phylogenetic trees to be reconstructed from their induced subtrees of size 
at most six f Theorem [23]). The statistical significance of this result is described 
in Corollary [231 and the next section. We begin with some definitions. 

Let B(X) denote the collection of binary phylogenetic X-trees (up to iso- 
morphism) and let B(X, k) denote the subsets of B(X) of size at most k. For 
T e B(X) and Y C X, let T\y denote the induced binary phylogenetic Y-tree 
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obtained from T by restricting the leaf set to Y. For V = {Ti, . . . , Tj} 6 B(X, k) 
let P|y := {Ti| y , . . . ,Tj\y} £ B(Y,k). We will often stray from standard set 
theoretical notation when writing restrictions, for example T\^ a b cd j will be 
written T\ abcd . 

We say that a collection M of subsets of X disentangles B(X, k) if one can 
reconstruct any V from the corresponding collection {V\y ■ Y € M}. This is 
equivalent to the condition that for any pair V, V £ B(X, k) we have 

V = V' V\ Y = V'\ Y for all Y 6 M. 

If in addition, there is a polynomial time (in \X\) algorithm that reconstructs 
V from the set \V\y : F £ M} we say that M efficiently disentangles B(X, k). 

For example, it is well known that when k = 1 the collection M of subsets of 
X of size four efficiently disentangles B(X, 1)(= £>(X)); indeed we may further 
restrict M to just those subsets of size four that contain a particular element, 
say x, of X (see, e.g., Theorem 6.8.8 of [H]). However, the subsets of X of size 
four do not suffice to to disentangle B(X, 2); moreover, neither do the subsets 
of X of size at most five. To establish this last claim, let X — {1,2, ... , 6}, and 
consider two pairs of trees shown in Figure[TJ Then {Ti\y, T2\y} — {T[\y, T^\y} 
for all subsets Y of size at most five, yet {Ti, T 2 } ^ {T[, T^}. However, allowing 
subsets of X of size at most six allows for the following positive result. 




Figure 1: Two pairs of trees which have the same combined set of splits. 

Theorem 23. B{X, 2) can be efficiently disentangled by the subsets of X of 
size at most six. 

To establish this result we require the following lemma. 

Lemma 24. Let T be a binary phylogenetic tree on a set Y of seven leaves, 
and suppose that S = {a, b, c} is a subset ofY of size three. Let x, y be any two 
distinct elements ofY — S. Then the quartet tree T\s U { x y is determined by the 
collection of quartet trees T\ q as q ranges across the following four values: 

(i) {a,b,x,y},{a,c,x,y},{b,c,x,y}, and 

(ii) {a,b,c,y}. 
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Proof. Consider T \ a bcy Without loss of generality we may suppose that T\ a b cy = 
ab\cy. If T\ a j, xy = ab\xy then Tlg^^ — ab\cx. On the other hand, if T\ a b xy — 
ax\by (or ay\bx) then T\g u s x \ = ax\bc (or ac\bx, respectively). □ 

Proof of Theorem H31 Consider the collection Q of quartets of X that contain 
a given element x £ X . The quartets in Q are of two types: let Q\ denote the 
quartets q in Q for which T\\ q — T 2 \ q (i.e. V\ q consists of just one tree) and let 
Qi = Q - Qx- Set Qx := {Ti|, (= T 2 |,) and set 

0.2 ■= : q £ g 2 } U {T 2 | 9 : q £ g 2 }. 

From Q 2 we construct a graph G(Q 2 ) that has vertex set <2 2 and that has an 
edge between two quartet trees, say ij\kl and i'j'\k'l', precisely if one of the 
trees in V displays both of these quartet trees. Note that G(Q 2 ) is the disjoint 
union of two cliques. Moreover, for any two quartets q,q' £ Q 2 , each of the 
two trees in Q 2 that correspond to q is adjacent (in G(Q 2 )) to precisely one of 
the two trees in Q 2 that correspond to q' , and the resulting two edges form a 
matching for these four vertices. 

Now, provided q U q' has cardinality at most six we can determine this 
matching since we can, by hypothesis, construct V\ q \jq> which must consist of two 
trees, and this pair of trees tells us how to match the two resolutions provided 
by V for q (viz. {Ti| g ,T 2 | g }) with the two resolutions of q' (viz. {Ti| 9 /,T 2 | g /}). 
In particular we can determine the two edges of G(Q 2 ) that connect these four 
vertices of G(Q 2 ). 

We claim that we can also determine (in polynomial time using just V\y for 
choices of Y of size at most six) the matching between these four vertices of 
G(Q 2 ) in the remaining case where qU q' has cardinality seven. 

Accepting for moment this claim, this allows us to reconstruct all the edges 
of G(Q 2 ) and in particular the two disjoint cliques of G(Q 2 ), which bipartition 
Q 2 . Taking the union of each clique with Q\ provides the pair of subsets {{Ti| g : 
1 £ Q}i {r 2 | g : q £ Q}} from which {Ti,T 2 } can be recovered. Furthermore all 
of this can be achieved in polynomial time. 

Thus it remains to establish the claim. Take two quartets q = {a, 6, c, x} 
and q' = {a', 6', c', x} from Q 2 where we are assuming (since \q U q'\ = 7) that 

{a,b,c}D{a',b',c'} = 0. 

We will now invoke Lemma [24] with S = {a, b, c} and Y = q U q 1 . Assume all 
of the four quartets in Lemma 1241 are in Q\\ by the conclusion of the lemma 
the quartet tree T\ a b cx is uniquely determined. Thus {a, 6, c, x} £ Qx, which 
contradicts our assumption. Therefore at least one of the four quartets of type 
(i) or (ii) in Lemma [2"4l is in Q 2 . 

Suppose there exists a a quartet q* of type (i) in Lemma [2^1 Then qUq* and 
q'Uq* both have cardinality at most six (for the latter, note that y in Lemma l24l 
must be one of the elements a',b',c' as y £ Y — q) and so we can determine 
the matching. Similarly, since {a', b', c', x] £ Q 2 we can invoke Lemma [2"4l with 
S = {a',b',c'} and the pair x,y' where y' is an element of Y — S different 
from x. By similar logic, at least one of the quartets satisfying condition (i) 
or (ii) in Lemma [24] must also be in Q 2 for this choice of S. Once again if we 
can find a quartet satisfying condition (i) of Lemma [2H we can determine the 
matching. A remaining possibility is that in both cases (i.e. for S = {a, 6, c} and 
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S = {a' , 6', c'}) we can only find a quartet in each case that satisfies condition 
(ii) of Lemma EH Call these two quartets q\ — {a, b, c, y} and q[ = {a 1 , b' , d ', y'}, 
respectively. Then the three sets qUqi, q' U q[ and q\ U q[ each have cardinality 
at most 6 (for the last case, note that y' is one of a, b, c and y is an element 
of a' ,b' ,c') and so we can determine the matching for these three pairs. This 
allows construction of 7i| 5 ug'ugiu<zi for i = 1,2 from the corresponding quartet 
trees; the matching for the four vertices of G(Q,2) corresponding to q U q' are 
then available by restriction. This completes the proof. □ 

An immediate consequence of Theorem [521 is the following. 

Corollary 25. Suppose a model has the property that from an arbitrary mixture 
of processes on two trees with the same leaf set of size six we can reconstruct 
the topology of the two trees. Then the same property applies for phylogenetic 
mixtures on two trees for any leaf set X (of any size greater than six), and by 
an algorithm that is polynomial in \X\. 

Remarks Peter Humphries has extended Theorem [23] to obtain analogous 
results for B(X, k) for k > 2 (manuscript in preparation.) 

The algorithm for disentangling two trees outlined in the proof of Theorem l23l 
would run in polynomial time, and a straightforward implementation of the 
method would have a run time complexity of 0(|X| 7 ). However, it is quite 
possible that a more efficient algorithm could be developed for this problem 
(and thereby for Corollary [231) . 

2.2 Models 
Clocklike mixtures 

Suppose one has a phylogenetic mixture on two trees T\ and Ti . In this section 
we are interested in whether one can reconstruct the pair \T\,Ti\ (or some 
information about this pair) from sufficiently long sequences. In the case where 
for each tree there is a stationary reversible Markov process (possibly also with 
rate variation across sites), and the (positive, finite) branch lengths of T satisfy 
a molecular clock some positive results are possible. 

Observation 26. The union of the splits in two trees T\ and T% on the same 
taxon set can be recovered from a phylogenetic mixture on the two trees under a 
molecular clock. 

To see this we simply consider the function p : X x X — > [0, 1] defined by 
setting p(x, y) to be the probability that species x and y are assigned differ- 
ent states by the mixture distribution (i.e. p(x, y) is the expected normalized 
Hamming distance between the sequences). Then p = d± + di where (by the 
molecular clock assumption) d\ and di are monotone transformations of tree 
metrics realized by T\ and Ti respectively. By split decomposition theory ([2]) 
it follows that £(Ti) U S(T2) can be recovered from p. 

Note that E(Ti) U£(T2) does not determine the set {Ti, T2} as the two pairs 
of trees in Figure Q] shows. However this example is somewhat special: 

Lemma 27. Suppose {Ti, T2} and {T{, T'^\ are two pairs of binary phylogenetic 
trees on the same set X of six leaves, and that 

£(Ti) U S(T 2 ) = E(Ti') U S(T^). 
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Then either {Ti, T2} — {T{, T%} or the two pairs of trees are as shown in Figure]]} 
(up to symmetries). 

Proof. The proof is simply a case-by-case check of split compatibility graphs. A 
split compatibility graph is a graph where each split is represented by a vertex 
and an edge connects two splits which are compatible. In this case there are 
three nontrivial splits for each tree topology; three splits being realizable on a 
tree is equivalent to those three splits forming a clique in the split compatibility 
graph. Thus the lemma is equivalent to saying that up to symmetries there is 
only one subset of the vertices of the split compatibility graph for six taxa which 
can be expressed as two three-cliques in two different ways. 

There are two unlabeled topologies on binary trees of six leaves: the caterpil- 
lar (with symmetry group of size eight) and the symmetric tree (with symmetry 
group of size 48) . First we divide the problem into the case of two caterpillar 
topologies, then the case of one caterpillar and one symmetric topology, finally 
two symmetric topologies. We label the two types of splits as follows: we call a 
split with three taxa on either side (such as 123|456) "type at", and a split with 
two taxa on one side and four on the other (such as 12|3456) "type y." 

Assume {Ti, T2} 7^ {T{, T^}. In the case of two caterpillar topologies it can 
be seen by eliminating cases that T\ and T2 cannot share a split of type y. 
Therefore the four type y splits of T% and T2 must form a square of distinct 
vertices in the split compatibility graph. Further elimination shows that the 
two trees in Figure [1] are the only ones possible up to symmetries. 

The cases involving a symmetric tree are even easier, as the choice of two 
splits in a symmetric tree determines the third. In the case of one caterpillar 
and one symmetric topology, this implies that there can be at most four type y 
splits in T\ and T2. Checking cases quickly eliminates all possibilities. Similar 
reasoning deals with the two symmetric topology case, proving the lemma. 

□ 

Theorem 28. Suppose that for a reversible stationary model (possibly with rate 
variation across sites) there is a method that is able to distinguish a phylogenetic 
mixture on trees T\ and Ti from a phylogenetic mixture on trees T[ and T^ (see 
Figure^) under branch lengths that satisfy a molecular clock on each tree. Then 
from any phylogenetic mixture on two binary trees for a leaf set X with both sets 
of branch lengths subject to a clock, one can recover the two trees by an algorithm 
that runs in polynomial (D(|A| 7 )J time. 

Proof. Combine Theorem [23l Observation [26j and Lemma [27] For the time 
efficiency estimate, the distance matrix can be estimated in 0(|A| 2 ) time, and 
the split decomposition can be done in 0(|X| 4 ) time [2]. □ 

Non-clocklike mixtures 

In [7] it was shown that under two-state symmetric (CFN) model one can have 
a mixture of two processes on one tree giving the same site-pattern frequency 
vector as a single process on a different tree. This requires that the two sets 
of branch lengths being mixed to be quite different and carefully adjusted. For 
example, we have: 

Corollary 29. If a two class phylogenetic mixture on a tree R has the same 
site-pattern frequency vector as a tree of a different topology S , then the two sets 
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of branch lengths cannot be clock-like (even for different rootings of the tree), 
nor can one branch length set be a scalar multiple of the other. 

Proof. There must be a taxon set abed such that R\ a bcd = ab\cd and S\ a b c d — 
ac\bd. Using the notation of [7J , (also explained in Section [2~3")) clocklike mixtures 
must have a pair of adjacent taxa (say a and b) such that k a — fcf,. For one 
set of branch lengths to be a nontrivial scalar multiple of another, all of the 
pendant fcj 's must be either less than or greater than one. Either of these cases 
contradicts Proposition 7 of [7]. □ 

However, one could ask if a more complex phylogenetic mixture on a tree 
could mimic an unmixed process on a different tree. Again a molecular clock 
rules this out, and for branch lengths that scale proportionately (as in a rates- 
across-sites distributions) we now show that identifiability of the underlying tree 
still holds. 

Theorem 30. Consider two binary phylogenetic trees T and T' on the same 
leaf set X of size n generating data under the CFN model. For T suppose we 
have a mixture of such processes that can be described by a set of branch lengths 
and a distribution T> of rates across sites which generates the same distribution 
on site patterns as that produced by an (unmixed) set of branch lengths on T' . 
Then T = T' and T> is the degenerate distribution that assigns all sites the same 
rate. 

Proof. It suffices to prove the result for n = 4 and X = {1, 2, 3, 4}, with T the 
tree 12|34, and T the tree 13|24. We denote the edge of T (resp. T') that 
is incident with leaf i by e\ (resp. e£) and the interior edge of T (resp. T') 
by eo (resp. e' ). Let 0- := 1 — 2p(e-) and let Ai denote the branch length 
of edge e, so that the probability of a change along e, is |(1 — /(2Aj)) where 
f{x) — Ep[exp(/ix)] is the moment generating function for the distribution of 
the rate parameter fi in T>. 

Then we have (see, e.g., Lemma 8.6.4 and Theorem 8.8.1 of 12 ): 

/(-2Ai - 2A 2 - 2A ) = 0[0' 2 and /(-2A 3 - 2A 4 - 2A ) = 6*36*4, 

and thus 

/(-2Ax - 2A 2 - 2A ) • /(-2A 3 - 2A 4 - 2A ) = 

Also, 

6^ 6*3 6*2 64 — /( — 2Ax — 2A2 — 2A3 — 2A 4 ). 
Combining these last two equations and setting r := — 2Ai— 2A2,s := — 2A3— 2A4; 

f(r + s) = f(r - 2A )/(s - 2A ) < f(r)f(s), (14) 

with equality precisely if Ao = 0. However, exp(fix) is an increasing function of 
fi for positive x. It follows that the random variables exp(/ir) and exp(/zs) are 
positively correlated, i.e. 

f(r + s)>f(r)f(s) 

with equality precisely if T> is a degenerate distribution. Consequently, (TT4"]) is 
an equality; thus D is a degenerate distribution and T' = T . □ 
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Remark Theorem [3UJ extends to provide an analogous result for the uniform 
distribution random cluster model on any even number q = 2r of states, since 
such a model induces the random cluster model on two states by partitioning 
the 2r states into two sets, each of size r. 

2.3 Mixed branch repulsion: larger trees 

In this section we find results analogous to those in [7 for trees larger than 
quartet trees. The main result is that two class phylogenetic mixtures on a tree 
can only mimic a tree which is topologically one nearest neighbor interchange 
away from the original tree. 

Let £(T) denote the set of leaves of a given tree T. We will write R >— » S 
to mean that there exists a two class phylogenetic mixture on R which gives 
exactly the same site-pattern frequency vector as some branch length set on a 
tree of topology S under the CFN model. Of course, if R s— ► S then £(R) = £{S). 

Theorem 31. Assume R and S are two topologically distinct trees on at least 
four leaves such that R >— » S . Then R and S differ topologically by one nearest 
neighbor interchange (NNI). Furthermore, assume the NNI partitions £(R) into 
the sets X\, . . . , X4, Then R\x t — &\Xi for any i (equality as rooted trees with 
branch lengths). 

For this proof we will draw notation and several ideas from the proof of 
the main result of [7J. For a four taxon tree with taxon labels 1 through 4 
we will label the the pendant edges with the corresponding numbers. We will 
write the quartet tree with the ab\cd split as simply ab\cd. Given two sets of 
branch lengths on a given tree we use ki to denote the ratio of the fidelities (see 
Section 1 1 . 2 j) of the two branch lengths for the edge i. We will constantly use 
the simple fact that if the edge of an induced subtree consists of a sequence of 
edges then the induced ki for that edge consists of the product of the k^s for the 
sequence of the edges (this holds because the fidelities are multiplicative along 
a path, and therefore their ratios are also). 

Lemma 32. The quartet splits ab\cd, ac\bd and ad\bc are invariant under the 
action of the Klein four group 

K 4 = {1, (ab)(cd), (ac)(bd), (ad) {be)}. 

□ 

The following lemma can be checked by hand. 
Lemma 33. Given numbers k a ,kb,k c , there exists a £ such that 

K(a) > K(b) and k a{a) > fc . (c ). 

□ 

The following lemma is a rephrasing of Proposition 3 of [7J : 

Lemma 34. If ab\cd >— ► ab\cd then the following two statements must be satis- 
fied: 

• k a = k b or k c = kd 
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• k a — or k c — . 

□ 

Lemma 35. If ab\cd >— > ac\bd then 

• There is some element a 6 K4 such that fc<j( a ) > k a r c \ > & CT (d) > ^a(b) 

• none of k a , . . . , kd are equal to one 

• either exactly one or exactly three of k a , . . . ,kd are greater than one 

• k a ^ k^ 1 and k c ^ k^ 1 ■ 

Proof. Each item in the list is from Proposition 7 of [7j with the exception of 
the last one. By Lemma I3"2l we can relabel such that k a > k c > kd > kf,. Let 
f(x) — x ~ . Note that f(x^ 1 ) = —f(x), f(x) is positive for x > 1 and strictly 
increasing for x > 0. By equation (12) of [7J, 

f(k a )f(k d )+f(k b )f(k c )>0. 

Assume first that k a > k c > kd > 1 > fc&. Then the above properties of / imply 
the following deductive chain: 

f(k a )f(k c ) + f(k b )f(k c ) > 

f(ka)+f(h) > 

f(ka) > fiK 1 ), 

implying k a ^ k^ 1 . The case where k a > 1 > k c > k d > k^ is similar, as is the 
proof that k c ^ k^ 1 . □ 

The proof of Theorem [31] rests on the following observation. 

Lemma 36. If R >-> S then R\ F >-> S\ F for any F C l(R). □ 

We will use this lemma by restricting taxon sets of the larger tree to sets of 
size five, then analyzing for which ordered pairs (R, S) of five leaf subtrees it 
holds that R >— > S. There are 225 ordered pairs of five leaf trees, however in 
the following lemma we show that symmetry considerations reduce the relevant 
number of interest to four. For ease of notation, we will write the five leaf 
subtree W a bcde as shown in Figure [5] 

Lemma 37. Given trees on five leaves R and S , the question of whether U >— » S 
or not is equivalent to the question of if one of the following is true: 



W12345 


>— » ^12345 


(15) 


W"12345 


>— > W^i3245 


(16) 


W / 12345 


>— > W^12354 


(17) 


W12345 


>-> ^13254 


(18) 



Proof. It can be assumed that R is W12345 by renumbering. Note that the 
symmetries of a five leaf tree are generated by (12), (34), and (13)(24) on the 
tree ^12345 • A combination of these symmetries applied to R and renumbering 

19 



a. 



c 



b 




6 7 




d 



e 



Figure 2: Definition of W a b c de- 



means that these symmetries can then be applied to the labels of S while still 
assuming that R is W12345. Using these symmetries S can be assumed to be 
either W a b c dA or W a bcd5- There are six such trees; a further application of the 
symmetries shows that the cases of S = W13254 and S — W23154 are equivalent, 



Lemma 38. Mixture \lb}) is impossible, i.e. VK12345 >/> W13245. 

Proof. Assume the contrary, and that ki's are labeled as in Figure [2] By (clear 
extensions of) Lemmas [33] and [33J we can assume that k\ > &2 and k\ > k% 
on these trees. By restricting to the taxon set to 1234, and noting that by 
Lemma I3S1 12 134 >— > 13| 24, we have k\ > > £4 > &2 and that k% and k^ are 
either both greater than one or both less than one by Lemma [35] By restricting 
to 1235, it is clear that k$ ^ 1. Assume k$ < 1. Restricting the taxon set to 
2345 means that 25 134 >— > 24|35; by testing elements of K4 in Lemma [351 and 
using the fact that and k^ are either both greater than one or both less than 
one and that k§ < 1, one must have k?k§ > k± > k% > k§. This contradicts 
the above statement that > k±. The case where k$ > 1 follows similarly by 
restricting to 1345. □ 

Lemma 39. Mixture i!8\) is impossible, i.e. W12345 >/* VT13254. 

Proof. Assume the contrary. First restrict to the taxon set 1345. For this taxon 
set 15|34 >— » 13|45, showing by Lemma [351 that k 3 ^ k^, k 3 ^ k± , and k 5 ^ 1. 
Second, restrict to taxon set 2345. For this taxon set the induced mixture 
is 25|34 >— > 25|34, therefore we can apply Lemma [341 Because k^ ^ k^ and 
^3 k^ 1 , it must be true that kik§ = k§ and ^2^6 = ^5 • This contradicts the 
fact that fog 7^ 1. □ 

Therefore we are left with mixtures (|15p and (|17p . implying the following 
corollary. 

Corollary 40. Assume R >— > S for two five-leaf trees R and S . Then R and S 
share a nontrivial split. □ 

We now present two more lemmas which will be used in the proof of Theo- 
rem[3T] Given rooted trees R and S let R — S denote the unrooted tree obtained 
by joining the roots of R and S together with an edge. 

Lemma 41. Assume R± — R2 >— > Si — 5*2, i(Ri) = £(Si), and all of the k's for 
the edges in R\ are one. Then R\ = S\ (equality with branch lengths). 



as are S = W13245 and S = ^44235- 



□ 
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Proof. Add a taxon e at the root of Ri (resp. Si) to obtain the unrooted tree 
Ru (resp. Su). We will show that the between-leaf distance matrices for Rjj 
and Su are the same, which implies that Rjj = Sjj and thus R\ — S\. Pick c 
and d distinct in <?(i? 2 ). Pick an arbitrary a and b £ £(Ri) and restrict to the 
taxon set abed. By Proposition 4 of [7j, the pairwise distance between a and b 
in R\ and Si (and thus in Ru and S(/) will be the same. To show that distances 
from taxa a G £(Ri) to the root taxon e are the same in Rjj and Sy, repeat the 
same process but for any a choose b such that the MRCA of a and b in R\ is 
the root of R\. Another application of Proposition 4 of [7 j in this case proves 
the proposition. □ 

Lemma 42. // R x — R 2 >-» Si — S 2 , £(i?i) = £(S X ) and £(i? 2 ) # £(S 2 ) tfien 
Ri = Si (equality with branch lengths.) 

Proof. For i,?/G l{Ri), let C y {x) be the set of edges in the path from a; to the 
MRCA of a; and y. Define 

<fvi x ) = n fce - 

eeCy(x) 

This takes the place (for induced subtrees) of a single k e . The idea of the proof 
is to use the previous lemma by showing that k e for any edge e in R\ is one. 
However, by induction it is enough to show that (Py( x ) — Vx{v) — 1 for any 
x,ye £(R 2 ). 

Since E(i? 2 ) ¥= £(S 2 ) Dut ^(^h) = ^(Sa) tner e exists a subset {a, 6, c} c 
£(i? 2 ) such that i? 2 restricted to the taxon set abc is the tree (ab)c, while S 2 
restricted to abc is (ac)6. Pick any 1,56 £{Ri). First restrict to taxon set aocx, 
for which ab\cx >— > ac|6a;. By Lemma [551 <^(,(a) 7^ </? a (&) and <^b(a) 7^ (6)] — 1 . 
Now restrict to the taxon set abxy, for which ab\xy >— * ab\xy. By Lemma 134) 
tpy(x) — <p x (y) and <p y (x) = [(p x (y)]~ 1 , implying that each tp is one. The lemma 
now follows. □ 

The final lemma allows for the combination of splits; it is a special case of 
Lemma 2 of |8j. We present an argument here for completeness. 

Lemma 43. Let T be a phylogenetic tree. If AU {x}\B £ ^{T\^ uBu ^ x j) and 
Au{y}\B&Jl(T\ AuBU{y} )1henAu{x,y}\B&Ji{T\ AuBU{XtV} ). 

Proof. First we note that if A\B G E(T\ AuB ) then one of A\BU{x} or Al){x}\B 
is contained in ^(T\ Al jgij^ x y) , otherwise the restriction of T\ A ugij^ x y to AUB 
cannot contain the split A\B. 

Applying this fact to the two splits A U {x}\B and A U {y}\B implies either 
the conclusion of the lemma or that AU{x}|BU{?/} and AU{y}|i?U{a;} are both 
in ~E{T\ AuBlJ [ x y y). This latter option is excluded by split compatibility. □ 

Proof of Theorem [221 Because R and S are topologically distinct yet have the 
same number of leaves, there must be at least one split in R which is not in 
S. Say this split is given by the edge e$. The edge cq must induce a nontrivial 
split, and therefore assign ei, . . . , e.\ and Ti, . . . , T4 such that R can be drawn 
as in Figure [U 

Pick any i e {1, ... ,4}. We claim that the split induced by edge is in 
S(S). If \£{Ti)\ — 1 then there is nothing to prove, so assume that |i(Tj)| > 2. 
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Figure 3: Notation used in the proof of Theorem I3T1 



Construct a five-leaf tree by choosing two leaves a, b from £(T,-) and also leaves 
c,d,e: one from each of the other three Tj . Because the split induced by eo is not 
in S by hypothesis, it also cannot be in S\ a b c de- An application of Corollary 1401 
now implies that the split induced by must be in S(<S l | a 6cde)- This is true for 
each such choice of abode: of these choices combined via Lemma 1431 show that 
the split induced by the edge ej is in S(5 f ). 

Four applications of Lemma U2] now prove the theorem. □ 

The following proposition says that the sort of mixture described in The- 
orem [31] is possible (assuming the main result of jT] ) . It is a simple general 
fact. 

Proposition 44. Let T%, . . . , T4 be rooted trees and R and S two trees on the 
taxon set {1,2,3,4}. Let R and S be the trees obtained from R and S by at- 
taching tree Tj in place of taxon i. Now if R >— > S then R >— » S . 

Proof. Let the vector y represent the state vector for the terminal taxa on R 
and S and let Xi represent the state vector for the tree Tj. Let p£(z) mean 
the probability of state vector z on a tree T with branch lengths 7; 7 will be 
omitted if understood. The statement R >— > S means exactly that there exist 
7i 5 72 7 73 and a such that 

ap^(y) + (1 - a)p^(y) = p^y) 

for any state vector y. We observe that 

4 

P W (X1,. ■ .,X±) = 2_j P W (y) Y[P T1 (Ei\Vi) 

y i=l 

for W — R, S, where p Ti {xi\yi) is the probability of state vector x% assuming 
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the root of Ti is in state yi. This implies 



ap^ (xj_, . . . , X4) + (1 - a)p^ 2 (xi, . . . , £4) 

4 

y i=i 

where the 7j are simply the 7^ along with the branch lengths of the Tj. 

_ _ □ 

For completeness we also record when a two class phylogenetic mixture on 
a tree can mimic a tree of the same topology under the CFN model. 

Proposition 45. // a two class phylogenetic mixture on a tree mimics a tree 
of the same topology under the binary symmetric model, then all branch lengths 
between the two sets must be the same with the possible exception of those for a 
guartet of adjacent edges sitting inside the tree. 

Proof. Assume a counter-example to Proposition l45l i.e. that there exists a tree 
R with two branch length sets which differ by more than a quartet of adjacent 
edges but which mix to mimic a tree of the same topology S under the binary 
symmetric model. Therefore, there exists a partitioning of R into subtrees A, 
B, and C meeting at a node such that there is an edge in each of A and B which 
differs in terms of branch length between the two sets. Note that if two branch 
length sets differ on a nontrivial rooted tree, then by induction one can find 
an induced rooted subtree of size two which differs in terms of branch length 
between the two branch length sets. Therefore there must be an induced rooted 
subtree of size two in each of A and B which differs in terms of branch length 
between the two branch length sets. Number the taxa thus chosen from A with 
1 and 2, and the taxa chosen from B with 3 and 4. Label an arbitrary taxon 
from C with 5. Now consider the induced 5-taxon tree induced by restricting the 
taxon set to 1 through 5. Label the edges as in Figure [H and assign (induced) 
ki's as before. 

From the above we can assume (perhaps after renumbering) that k\ ^ 1 
and fc3 ^ 1. By restricting R to the taxon set 1234 we have by Lemma I3"4l that 
k\ = k^ 1 and k^ = k^ (perhaps after renumbering.) Because k% ^ 1, we have 
k 3 + kl x . Thus using Lemma 15^1 by restricting to 1534 we have k\k§ = k^ and 
by restricting to 2534 we have kik§ = k§. Therefore k\ = k% = 1, which is a 
contradiction. □ 

3 Conclusion 

We have presented a number of new results which help to clarify when non- 
idcntifiablc phylogenetic mixtures may pose a problem for reconstruction. How- 
ever, the message isn't completely straightforward. The first section shows 
that the space of site-pattern frequency vectors for phylogenetic mixtures on 
many quartet trees contains a relatively large non-identifiable region. Further- 
more, this non-identifiable region under the CFN model contains site-pattern 
frequencies for resolved trees with substantial internal branch lengths. Yet, 
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these spaces were constructed using specific trees of extreme branch lengths, 
raising the question of whether corresponding results hold for more reasonable 
parameter regimes and "random" sets of trees which one might find from data. 
Furthermore, we wonder if it is possible to find simple i?-descriptions of the 
phylogenetic mixture polytope for larger star trees. 

On the other hand, the second section shows generally that phylogenetic 
mixtures on just two trees may not pose so much of a problem. In particular, 
our results make progress towards showing that clocklike two class phylogenetic 
mixtures may be identifiable under further assumptions. We also show that 
pairs of trees under CFN rates-across-sites mixtures are identifiable. Finally, 
we show that two class phylogenetic mixtures on a tree cannot "change" the 
topology too much. 

In general, many interesting questions remain and we look forward to seeing 
further progress in this field. 
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